Back

Briefings in Bioinformatics

Oxford University Press (OUP)

Preprints posted in the last 30 days, ranked by how well they match Briefings in Bioinformatics's content profile, based on 11 papers previously published here. The average preprint has a 0.04% match score for this journal, so anything above that is already an above-average fit.

1
Survival risk heterogeneity among patients with NSCLC receiving nivolumab visualized by risk scores generated from deep learning method DeepSurv using tumor gene mutations

Nishiyama, N.

2026-02-22 oncology 10.64898/2026.02.15.26346303
Top 0.3%
39× avg
Show abstract

Immunotherapy with immune checkpoint inhibitors and immunotherapy combined with chemotherapy have represented promising treatments for NSCLC patients leading to prolonged survival. However, the majority of patients with advanced NSCLC have a poor prognosis. The identification and development of biomarkers for stratifying responders and non responders to immune checkpoint inhibitors contribute to unravel the mechanism of immune checkpoint pathway and the immune tumor interaction underlying the responses and are urgently needed to improve clinical outcomes of immune checkpoint inhibitor treatment. In this study, we analyzed the clinical and gene mutation data of NCSLC patients treated with nivolumab containing immunotherapy or nivolumab containing immunotherapy combined with chemotherapy (the immunotherapy treated group, n=119) and chemotherapy alone (the chemotherapy alone treated group, n=991) extracted from the MSK CHORD dataset. A DeevSurv model, a deep learning based extension of the Cox proportional hazards model was trained to generate survival risk score of each patient with binary statuses of thirty one gene mutations as input features into the model. The thirty one genes were selected based on population level mutation frequency, patient level variance in mutation status, and univariate Cox proportional hazards analyses evaluating the association between the presence or absence of each gene mutation and overall survival. The performance of the trained DeepSurv model was evaluated on the test set of the immunotherapy treated group using the concordance indexes (C index). The trained model was subsequently applied without retraining to the entire chemotherapy alone treated group as a control. The resulting C indexes for the immunotherapy treated group and chemotherapy alone treated group were 0.789 and 0.483, respectively. All patients within each group were divided into high and low risk groups according to the median predicted risk score. Kaplan Meier survival curves of high and low risk groups (n=43 vs n=70) in the immunotherapy treated group revealed a significant separation (log rank p<0.001), whereas no separation was observed in chemotherapy alone treated group (p=0.62). In the combined cohort of the immunotherapy treated group and chemotherapy alone treated group, the interaction between the DeepSurv derived risk score and treatment modality was significant (HR for interaction 1.47, 95% CI from 1.32 to 1.65, p<0.005), suggesting the DeepSurv derived risk score predictive value specific to the immunotherapy. Principal component analysis and permutation importance analysis were performed as complementary analyses to assess individual genes associated with the DeepSurv derived risk score and identified ZFHX3, SMARCA4, ALK, BTK, and NOTCH2 as major contributors to survival risk stratification. Collectively. we suggested that nonlinear coupling pattern of 31 tumor gene mutation statuses in the DeepSurv model captures the heterogeneity of survival risk among nivolumab containing immunotherapy or nivolumab containing immunotherapy combined with chemotherapy treated patients with NSCLC which was visualized as clear separation between high risk and low risk groups divided by the median value of the risk scores.

2
Constructing a Literature-Derived Database for Benchmarking Polygenic Risk Score Construction Methods with Spectral Ranking Inferences

Sebastian, C.; Yu, M.; Jin, J.

2026-03-03 genetic and genomic medicine 10.64898/2026.03.01.26347258
Top 0.4%
30× avg
Show abstract

Polygenic risk scores (PRSs) have emerged as a valuable tool for genetic risk prediction and stratification in human diseases. Over the past decade, extensive methodological efforts have focused on improving the predictive power of PRS, leading to the development of numerous methods for PRS construction. Benchmarking these various methods thus becomes an essential task that is crucial for guiding future PRS applications. While studies have benchmarked subsets of these methods on specific phenotypes and cohorts, the resulting evidence remains fragmented, with a lack of work that comprehensively assess the relative performance of the various PRS methods. In this study, we addressed this gap by systematically constructing a PRS method benchmarking database synthesizing published results from 2009 to 2025. We applied a spectral ranking inference framework with uncertainty quantification to rank 14 PRS methods that had been adequately compared against each other in the literature. We constructed rankings using two complementary sources: original method-development studies and applications/benchmarking studies. While the highest-ranked methods (LDpred2 and AnnoPred) and the lowest-ranked method (C+T) were consistently identified from both sources, the relative ordering of most methods showed moderate variability. We further constructed phenotype-specific rankings, providing more detailed insights into the robustness and phenotype-specific strengths of individual methods. Collectively, the overall and phenotype-specific rankings of the PRS methods, along with the curated benchmarking data from the literature, provide a dynamic and practical reference database that can continuingly be updated with emerging new PRS methods and published benchmarking results to guide future PRS applications.

3
An Integrated Deep Learning Framework for Small-Sample Biomedical Data Classification: Explainable Graph Neural Networks with Data Augmentation for RNA sequencing Dataset

Guler, F.; Goksuluk, D.; Xu, M.; Choudhary, G.; agraz, m.

2026-02-24 genetic and genomic medicine 10.64898/2026.02.22.26346827
Top 0.5%
21× avg
Show abstract

Applying deep learning models to RNA-Seq data poses substantial challenges, primarily due to the high dimensionality of the data and the limited sample sizes. To address these issues, this study introduces an advanced deep learning pipeline that integrates feature engineering with data augmentation. The engineering application focuses on biomedical engineering, specifically the classification of RNA-Seq datasets for disease diagnosis. The proposed framework was initially validated on synthetic datasets generated from Naive Bayes, where MLP-based augmentation yielded a notable improvement in predictive performance. Building on this foundation, we applied the approach to chromophobe renal cell carcinoma (KICH) RNA-Seq data from The Cancer Genome Atlas (TCGA). Following standard preprocessing steps normalization, transformation, and dimensionality reduction, the analysis concentrated on three main aspects: augmentation strategies, preprocessing methods, and explainable AI (XAI) techniques in relation to classification outcomes. Feature selection was performed through PCA, Boruta, and RF-based methods. Three augmentation strategies linear interpolation, SMOTE, and MixUp were evaluated. To maintain methodological rigor, augmentation was applied exclusively to the training set, while the test set was held out for unbiased evaluation. Within this framework, we conducted a comparative assessment of multiple deep learning architectures, including MLP, GNN, and the recently proposed Kolmogorov-Arnold networks (KAN). The GNN achieved the highest classification accuracy (99.47%) when trained with MixUp augmentation combined with RF feature selection, and achieved the best F1 score (0.9948). Consequently, the GNN-based XAI framework was applied to the RF dataset enriched with MixUp. XAI analyses identified the top 20 most influential genes, such as HNF4A, DACH2, MAPK15, and NAT2, which played the greatest role in classification, thereby confirming the biological plausibility of the model outputs. To further validate model robustness, cervical cancer and Alzheimers RNA-Seq datasets were also tested, yielding consistent and reliable results. Overall, the findings highlight the value of incorporating data augmentation into deep learning models for RNA-Seq analysis, not only to improve predictive performance but also to enhance biological interpretability through explainable AI approaches.

4
Retrospective evaluation of human genetic evidence for clinical trial success using Mendelian randomization and machine learning

Ravarani, C. N. J.; Arend, M.; Baukmann, H. A.; Cope, J. L.; Lamparter, M. R. J.; Sullivan, J. K.; Fudim, R.; Bender, A.; Malarstig, A.; Schmidt, M. F.

2026-02-23 pharmacology and therapeutics 10.64898/2026.02.19.26346536
Top 0.6%
20× avg
Show abstract

Human genetics has become a cornerstone of drug target discovery, yet the value of Mendelian randomization (MR) for predicting clinical success remains uncertain. Here, we systematically evaluated MR across 11,482 target-indication pairs with documented Phase II clinical outcomes to assess its utility for drug development. We find that MR statistical significance alone does not enrich for Phase II success, in contrast to genome-wide association study (GWAS) support, which confers an increase in success probability. However, this apparent limitation reflects the heterogeneous nature of clinical failure and the fact that MR encodes information beyond P values. When MR-derived features, including instrument strength and explained variance, are integrated into machine learning models, predictive performance improves substantially. An MR-informed XGBoost classifier identifies target-indication pairs with a 55% overall approval rate, corresponding to a 6.4-fold enrichment over unstratified programs and a 2.8-fold improvement over GWAS- supported targets in Phase II. Notably, this enrichment is achieved without reliance on statistically significant MR results. Our findings demonstrate that MR is most informative when treated as a graded, context-dependent source of causal evidence rather than a binary hypothesis test, and that its integration with machine learning enables scalable, genetics-informed prioritization of drug targets across the clinical pipeline.

5
Genome-Wide Significance Reconsidered: Low-Frequency Variants and Regulatory Networks in Autism

Mendes de Aquino, M.; Engchuan, W.; Thompson, S.; Zhou, X.; Safarian, N.; Chen, D. Z.; Trost, B.; Salazar, N. B.; Ma, C.; Thiruvahindrapuram, B.; Vorstman, J.; Scherer, S. W.; Breetvelt, E.

2026-02-12 genetic and genomic medicine 10.64898/2026.02.11.26346090
Top 0.7%
19× avg
Show abstract

Low-frequency variants (LFVs), defined by minor allele frequencies (MAF) of 1-5%, occupy the gap between common and rare variants in both frequency and effect size. The conventional genome-wide association study (GWAS) significance threshold (5x10-) is overly conservative for LFVs, which account for more than 25% of variants in GWAS. This limitation may obscure meaningful associations in highly heritable yet genetically complex disorders such as autism spectrum disorder (ASD). We hypothesize that the scarcity of significant LFVs in ASD GWAS reflects statistical constraints rather than a true lack of association. To address this, we derived a MAF-specific genome-wide significance threshold using linkage disequilibrium-informed simulations applied to ASD GWAS summary statistics, identifying 2.03x10- as optimal. Applying this threshold revealed three novel LFVs mapping to zinc finger proteins (ZNF420, ZNF781) and known ASD-related genes (KMT2E, PRKDC, MCM4). Enrichment analyses suggested their function in nervous system development and gene regulation. Our findings highlight the contribution of LFVs to ASD risk and underscore the importance of frequency-aware association strategies.

6
Benchmarking HLA genotyping from whole-genome sequencing across multiple sequencing technologies

Cremin, C.; Elavalli, S.; Paulin, L.; Arres Reche, J.; Saad, A. A. Y. A.; Attia, A.; Minas, C.; Aldhuhoori, F.; Katagi, G.; Wu, H.; Sidahmed, H.; Mafofo, J.; Soliman, O.; Behl, S.; Pariyachery, S.; Gupta, V.; Ghanem, D.; Sajjad, H.; Cardoso, T.; El-Khani, A.; Al Marzooqi, F.; Magalhaes, T.; Sedlazeck, F. J.; Quilez, J.

2026-02-12 health informatics 10.64898/2026.02.10.26345621
Top 0.7%
19× avg
Show abstract

BackgroundThe hyperpolymorphic nature and structural complexity of the human leukocyte antigen (HLA) genomic region present challenges for accurate and scalable typing across diverse sample types. While wholegenome sequencing (WGS) offers the opportunity to infer HLA genotypes without targeted enrichment, systematic benchmarks across sequencing platforms, biospecimens and coverage levels remain limited. ResultsWe assembled a multi-platform resource of WGS datasets derived from short-read (Illumina, MGI) and long-read (Oxford Nanopore Technologies R9 and R10) sequencing, spanning 29 biospecimens including cell lines, blood, buccal swab and saliva. We evaluated the performance of the HLA caller HLA*LA across 13 HLA genes, using a clinically validated assay as reference. WGSbased HLA genotyping achieved [~]95% accuracy across sequencing platforms, with Class I loci exhibiting higher accuracy than Class II. Crossplatform concordance was high, and performance remained consistent across Illumina, MGI and Oxford Nanopore chemistries. Analysis of blood, buccal swab and saliva samples showed that blood and buccal swabs supported accurate HLA inference, whereas saliva yielded reduced concordance. Downsampling experiments demonstrated that 15x coverage was sufficient to retain >95% accuracy at twofield resolution, with lower depths supporting lower-resolution typing. ConclusionsOur results demonstrate that WGS provides a robust, platformagnostic framework for accurate HLA genotyping across sample types and coverage levels. These benchmarks establish practical conditions for reliable HLA inference and underscore the utility of WGS for populationscale HLA analyses and future clinical applications.

7
Federated penalized piecewise exponential model for horizontally distributed survival data: FedPPEM

Islam, N.; Luo, C.; Tong, J.; Polleya, D. A.; Jordan, C. T.; Haverkos, B.; Bair, S.; Kent, A.; Weller, G.

2026-02-12 health informatics 10.64898/2026.02.11.26346054
Top 0.8%
17× avg
Show abstract

Cox proportional hazard regressions are frequently employed to develop prognostic models for time-to-event data, considering both patient-specific and disease-specific characteristics. In high-dimensional clinical modeling, these biological features can exhibit high collinearity due to inter-feature relationships, potentially causing instability and numerical issues during estimation without regularization. For rare diseases such as acute myeloid leukemia (AML), the sparsity and scarcity of data further complicate estimation. In such cases, data augmentation through multi-site collaboration can alleviate these problems. However, this often necessitates sharing individual patient data (IPD) across sites, which presents challenges due to regulatory barriers aimed at protecting patient privacy. To overcome these challenges, we propose a privacy-preserving algorithm that eliminates sharing IPD across sites and fits a federated penalized piecewise exponential model (FedPPEM) to estimate potential effects of clinical features using summary statistics. This algorithm yields results nearly identical to those from pooled IPD, including effect size and standard error estimates. We demonstrate the models performance in quantifying effects of clinical features and genetic risk classification on overall survival using real-world data from [~]1200 newly diagnosed AML patients across 33 U.S. sites. Although applied in AML context, this model is disease-agnostic and can be implemented in other diseases and clinical contexts.

8
PHARMWATCH: A Multilayer Pharmacogenomics Safety System for Accurate Star Allele Interpretation

Eisenhart, C. E.; Brickey, R.; Mewton, J.

2026-02-28 genetic and genomic medicine 10.64898/2026.02.26.26347200
Top 1%
12× avg
Show abstract

The Clinical Pharmacogenetics Implementation Consortium (CPIC) bases its drug-gene recommendations on the assignment of star alleles, which map known genotypes to defined functional categories and corresponding drug dosage guidelines. The star allele framework, first proposed in 1996 for the CYP gene family and later formalized with CPICs establishment in 2010 [1, 2], remains foundational to pharmacogenomics. However, this system has notable limitations. Its dependence on a restricted set of benchmark single nucleotide polymorphisms (SNPs) excludes rare or novel pathogenic variants that can invalidate a star allele call and lead to incorrect dosing recommendations. Furthermore, nearby non-pathogenic variants can interfere with haplotype interpretation, introducing additional risk of misclassification. To address these gaps, we developed PHARMWATCH, a multistep pharmacogenomics workflow for comprehensive variant analysis, allele tracking, and contextual interpretation. PHARMWATCH incorporates two algorithmic safeguards designed to identify genomic alterations that compromise star allele accuracy: (1) de novo germline variant screening using the ACMG-based BIAS-2015 classifier and (2) variant interpretation in context (VIIC) to validate the functional integrity of star allele-defining SNPs [3]. Together, these layers enhance the reliability of pharmacogenomic reporting, enabling safe, automated, and review-ready recommendations that extend beyond the constraints of traditional star allele-based approaches.

9
GPAS: an online AI system for rapid and accurate pathogen identification and LLM-based interpretation

Li, T.; Hong, H.; Fan, D.; Li, J.; Li, T.; Wu, J.; Jiang, S.; Xie, X.; Zhang, Y.; Hu, M.; Yin, X.; Zhang, Y.; Ma, H.; Liu, Z.; Su, Z.; Yu, X.; Liu, Y.; Yuan, H.; Zheng, W.; Liu, H.; Ma, M.; Li, X.; Shen, Y.; Zhang, C.; Wang, Y.; Zhao, B.; Sun, L.; Han, Q.-Y.; Chen, J.; Zhang, K.; Chen, L.; Wang, N.; Li, W.; Man, J.; He, K.; Dong, F.; Du, F.; Yi, Y.; Li, A.; Zhou, T.; Zhang, X.; Li, T.

2026-02-20 public and global health 10.64898/2026.02.18.26346517
Top 1%
11× avg
Show abstract

Accurate identification of unknown pathogens is critical for medicine and public health, yet current metagenomic workflows remain heavily dependent on specialized bioinformatics expertise and manual interpretation, creating substantial bottlenecks in time-sensitive diagnostic settings1. The key challenges lie in achieving precise species identification amidst high background noise and translating complex microbial data into clinically actionable insights2,3. Here we present the Global Pathogen Analysis System (GPAS), an integrated computational framework that combines rapid and accurate pathogen identification with large language model (LLM)-based semantic interpretation. Central to GPAS is a dynamic-library alignment mechanism informed by prior probabilities of inter-species misclassification. By integrating a hybrid machine learning model that couples elastic neural networks with Bayesian inference, this approach substantially reduces both false positives and false negatives, achieving species-level accuracy superior to existing state-of-the-art tools. To enable clinical interpretation, we constructed a unified microbial knowledge graph integrating global metagenomic and metaviromic sample repositories, and trained a pathogen-specialized LLM agent. Through end-to-end reinforcement learning, the agent autonomously executes multi-step reasoning workflows extracting pathogen-specific insights from complex data and generating human-readable, evidence-based reports. Application to throat swab samples demonstrates that GPAS not only accurately identifies pathogenic microorganisms but also reveals how SLE-associated immune dysregulation reshapes the respiratory microbiome and promotes pathobiont overgrowth, providing clinically instructive interpretations. By substantially lowering technical barriers to pathogen identification, GPAS offers an accessible yet powerful platform for clinical diagnostics, public health surveillance, and microbiome research. The system is freely available at: https://gpas.nh.ac.cn/.

10
Act or Defer: Error-Controlled Decision Policies for Medical Foundation Models

Jin, Y.; Moon, I.; Zitnik, M.

2026-02-26 health informatics 10.64898/2026.02.23.26346927
Top 1%
11× avg
Show abstract

Clinical deployment of foundation models requires decision policies that operate under explicit error budgets, such as a cap on false-positive clinical calls. Strong average accuracy alone does not guarantee safety: errors can concentrate among patients selected for action, leading to harm and inefficient use of healthcare resources. Here we introduce SO_SCPLOWTRATC_SCPLOWCP, a stratified conformal framework that turns foundation model predictions into decision-ready outputs through error-controlled selection and calibrated deferral. SO_SCPLOWTRATC_SCPLOWCP first selects a subset of patients for immediate clinical action while controlling the false discovery rate at a user-specified level. For the remaining patients, it returns prediction sets that achieve target coverage conditional on deferral, supporting confirmatory testing or expert review. When clinical guidelines define relationships among disease states, SO_SCPLOWTRATC_SCPLOWCP incorporates a utility graph to produce clinically coherent prediction sets without sacrificing coverage guarantees. We evaluate SO_SCPLOWTRATC_SCPLOWCP in ophthalmology and neuro-oncology across diagnosis, biomarker prediction, and time-to-event prognosis. Across tasks, SO_SCPLOWTRATC_SCPLOWCP controls the false discovery rate among selected patients and provides valid, selection-conditional coverage for deferred patients. In neuro-oncology, it enables H&E-based diagnosis under a fixed error budget, reducing reliance on reflex molecular assays and lowering laboratory cost and turnaround time. SO_SCPLOWTRATC_SCPLOWCP establishes error-controlled decision policies for safe deployment of medical foundation models.

11
Multi-Omics Integration for Identification of Prognostic Molecular Signatures for Survival Stratification in Lung Cancer

Maitra, C.; Das, V.; Seal, D. B.; De, R. K.

2026-03-02 oncology 10.64898/2026.02.28.26347335
Top 1%
11× avg
Show abstract

AO_SCPLOWBSTRACTC_SCPLOWLung cancer is characterized by profound intratumoral and inter-patient heterogeneity, spanning histological subtypes, molecular landscapes, and the tumor microenvironment. While multi-omics integration is essential for capturing this complexity, leveraging these data to explicitly define survival-associated subpopulations remains a significant challenge. In this study, we developed NeuroMDAVIS-FS, an unsupervised deep learning framework designed to stratify lung cancer patients by survival risk, and identify molecular determinants underlying improved clinical outcomes. Using the CPTAC cohort, we integrated genomic (CNV), transcriptomic (RNA-seq), and proteomic profiles to extract modality-specific features. Candidate biomarkers were validated through Kaplan- Meier (KM) survival analysis and univariate Cox proportional hazards (CoxPH) regression. A final multivariate CoxPH model effectively stratified patients into high-risk and low-risk cohorts (Kaplan Meier p-value < 0.001). Notably, the integration of these molecular features with baseline clinical models significantly enhanced prognostic accuracy, improving the concordance index by 43.79% in LUAD, 31.05% in LSCC, and 23.76% across the pan-lung cancer cohort. These results demonstrate that NeuroMDAVIS-FS identifies robust, biologically relevant features that surpass traditional clinical variables in predicting patient outcomes, offering a scalable path for precision oncology.

12
Learning lifetime disease liability reveals and removes genetic confounding in electronic health records

Di, Y.; Cai, N.

2026-02-22 genetic and genomic medicine 10.64898/2026.02.15.26346336
Top 1%
11× avg
Show abstract

Electronic health records (EHRs) have become the cornerstone of population-scale genetic studies1, but factors including patterns of healthcare use shape which and how diagnoses are recorded, leading to confounding effects in genetic associations with EHR codes2. In this study we propose EDGAR, a deep learning framework that recovers lifetime disease liability from EHR by aligning diagnostic codes with clinically validated measures and disease labels in a set of individuals prioritized through active learning. EDGAR yields representations that better capture disease-specific effects in genome-wide association analyses (GWAS). It also enables us to isolate a genetic factor that captures systemic biases in EHR codes, which distorts cross-disease correlations and drives spurious links with behavioral and socio-economic traits. We find that this factor generalizes across EHRs, and its identification in one EHR enables its removal from existing GWAS in another. Overall, our work presents a promising direction for improving specificity of EHR-based GWAS.

13
FA-NIVA: A Nextflow framework for automated analysis of Nanopore based long-read sequencing data for genetic analysis in Fanconi anemia

Neurgaonkar, P.; Dierolf, M.; O'Gorman, L.; Remmele, C.; Schaeffer, J.; Popp, I.; Borst, A.; Rost, S.; Ankenbrand, M.; Kratz, C.; Bergmann, A.; Kalb, R.; Yu, J.

2026-03-04 genetic and genomic medicine 10.64898/2026.02.27.26346867
Top 1%
11× avg
Show abstract

MotivationFanconi anemia (FA) is a rare disease mainly caused by biallelic pathogenic variants, including structural variants such as large deletions and insertions in FA genes. Currently, variant detection is based on short-read sequencing and probe-based approaches. However, determining the exact genomic breakpoint or achieving allelic discrimination remains challenging. Nanopore-based long-read sequencing enables a comprehensive detection of FA variants, but a unified bioinformatic analysis platform for these data is missing. ResultsWe present FA-NIVA (Fanconi anemia - Nanopore Indel and Variant Analysis), an automated and adaptable analysis workflow tailored for Nanopore-based long-read sequencing data in FA genetic analysis. FA-NIVA integrates state-of-the-art tools to comprehensively detect both single nucleotide variants (SNVs) and structural variants (SVs). Our analysis platform enhances genotyping accuracy for biallelic variants by a joint SNV-SV based phasing in FA associated genes. Built within the Nextflow ecosystem and powered by containerized Docker images, FA-NIVA ensures reproducibility, flexibility, scalability and transparency across different computing environments. Together, FA-NIVA provides a robust end-to-end solution for the automated analysis of SVs and SNVs and high-resolution phasing analysis in FA genes, enabling an accurate and efficient pipeline for genetic analysis. AvailabilityFA-NIVA is available on GitHub at: https://github.com/UKWgenommedizin/FA-NIVA.

14
PhenoSS: Phenotype semantic similarity-based approach for rare disease prediction and patient clustering

Chen, S.; Nguyen, Q. M.; Hu, Y.; Liu, C.; Weng, C.; Wang, K.

2026-03-02 health informatics 10.64898/2026.02.26.26347219
Top 1%
11× avg
Show abstract

ObjectiveSystematic clinical phenotyping using Human Phenotype Ontology (HPO) is central to rare disease diagnosis. However, current disease prioritization (ranking candidate diseases from HPO for a patient) methods face key challenges: they often fail to account for the hierarchical structure of HPO terms, ignore dependencies among correlated terms, and do not adjust for batch effects arising from systematic differences in phenotype documentation across cohorts, institutions, or clinicians. We aim to develop a scalable and statistically principled framework to address these limitations for rare disease prediction and patient stratification. MethodsWe developed PhenoSS, a Gaussian copula-based framework that models disease-specific marginal prevalence of HPO terms while capturing their joint dependencies through a multivariate normal distribution. Phenotype frequencies were estimated using external curated resources, including OARD (Open Annotations for Rare Diseases) and HPO annotations. PhenoSS supports both pair-wise phenotype similarity calculation for patient clustering and posterior odds estimation for patient-specific disease prioritization. A batch-effect correction module mitigates systematic phenotyping differences across datasets. ResultsAcross diverse simulation scenarios, PhenoSS demonstrated robust disease-prediction performance and consistently improved accuracy after batch-effect correction. In real electronic health record (EHR) data, PhenoSS identified clinically meaningful patient clusters and effectively distinguished patients with different rare diseases. In disease prioritization tasks, PhenoSS achieved competitive performance with existing methods, particularly for patients exhibiting sparse or noisy phenotype annotations. ConclusionPhenoSS provides a statistically interpretable framework for modeling phenotypic heterogeneity in rare disease research and is adaptable to other structured clinical vocabularies such as SNOMED-CT and ICD codes.

15
Long-read metagenomics and methylation-based binning allow the description of the emerging high-risk antibiotic resistance genes and their hidden hosts in complex communities

Markkanen, M.; Putkuri, H.; Kiciatovas, D.; Mustonen, V.; Virta, M.; Karkman, A.

2026-02-22 public and global health 10.64898/2026.02.18.26346558
Top 1%
11× avg
Show abstract

Antibiotic resistance genes (ARGs) circulating among clinically relevant bacteria pose serious challenges to public health. Given the ancient and environmental bacterial origins of ARGs, a better understanding of the carriers of ARGs beyond the clinically most relevant species is urgently needed for more farsighted resistance monitoring and intervention measures. While the risks of emerging ARGs from environmental sources have been recognized, the identification bottlenecks stem from the limitations of shotgun metagenomic sequencing and bioinformatic methods. Here, we used long-read metagenomic sequencing and bacteria-specific methylation profiles to re-establish the links between established (well-described) or latent (absent in databases) ARGs and their bacterial and genetic contexts in wastewater. The base modification data produced by PacBio SMRT sequencing was analyzed by an in-house pipeline utilizing position weight matrices and UMAP visualizations. The approach was validated by a synthetic community with known bacterial composition. Our analysis revealed several previously unreported ARGs and their hosts with varying risk levels defined by their potential as emerging public health threats. For instance, Arcobacter, as one of the prevalent taxa in influent wastewater, was shown to carry a latent beta-lactamase gene with high predicted mobility potential. Of the other emerging beta-lactamases, we provided a real-life example of ongoing pdif module-mediated genetic reshuffling of the blaMCA gene occurring at least within Acinetobacter hosts in our samples. Additionally, we identified Simplicispira, Phycisphaerae, and environmental groups of the Bacteroidales order as the carriers of established, clinically important ARGs. These findings support the intermediate host roles of strictly environmental bacteria for the further dissemination of mobilized ARGs, highlighting the importance of exploring the uncultivated, or non-pathogenic, carriers of ARGs for the early detection of newly arising ARGs and mobility mechanisms.

16
Statistical uncertainty explains the poor agreement in polygenic scoring for type 2 diabetes

Mandla, R.; Li, X.; Shi, Z.; Abramowitz, S.; Lapinska, S.; Penn Medicine Biobank, ; Levin, M. G.; Damrauer, S. M.; Pasaniuc, B.

2026-02-27 genetic and genomic medicine 10.64898/2026.02.25.26347015
Top 1%
11× avg
Show abstract

Polygenic scores (PGS) have emerged as an important tool for genetic risk prediction in medicine to identify individuals at high-risk for disease. A major limitation in their implementation is the apparent disagreement among scores for the same individual decreasing their interpretability and utility in clinical settings. Here we show that the poor agreement across PGSes for type 2 diabetes (T2D) is fully explained by statistical uncertainty in PGS-based prediction; individual-level uncertainty estimates from a single PGS explain the variability across existing PGSes. We provide an approach for the selection of high-risk individuals that incorporates measures of uncertainty and show that individuals with high confidence based on their PGS uncertainty have higher risk agreement across existing PGS and are more likely to develop T2D than high-risk individuals based on only point estimates of PGS. Together, these findings shed light on the factors underlying a roadblock in PGS implementation and underscore the need to incorporate uncertainty in PGS-based predictions.

17
A spatial multi-omic portrait of survival outcome for clear cell renal cell carcinoma

Meyer, L.; Engler, S.; Lutz, M.; Schraml, P.; Rutishauser, D.; Bertolini, A.; Lienhard, M.; Beisel, C.; Singer, F.; De Souza, N.; Beerenwinkel, N.; Moch, H.; Bodenmiller, B.

2026-03-04 oncology 10.64898/2026.03.02.26347390
Top 1%
11× avg
Show abstract

Clear cell renal cell carcinoma (ccRCC) is the leading cause of kidney cancer-related death, but how the tumor microenvironment shapes patient survival is not completely understood. Here, we describe the characterization of ccRCC tumor ecosystems from 498 patients using imaging mass cytometry with a focus on tumor, myeloid, and T cell landscapes. Data from more than 3 million single cells is analyzed using machine-learning to identify key ecosystem features that outperform basic clinical data for predicting patient survival. We define three survival ecotypes of ccRCC: Poor ecotypes, correlate with the worst survival, have high levels of ICAM1 and CD44 expression in tumor cells and are enriched in M2-like macrophages and interactions of exhausted CD8+ T cells with macrophages. Favorable ecotypes are characterized by high levels of VHL on tumor cells and of HLADR on myeloid cells and contain Th1-like CD4+ T cells. Medium ecotypes have the highest endothelial cell density and various immune-to-tumor interactions. Multi-omic characterization of these ecotypes using targeted genomic sequencing and metabolic imaging reveals distinct genomic and metabolic features, including BAP1 mutations in Poor and VHL monodriver/wild-type status in Favorable patients. We show that deep learning allows ecotype prediction directly from standard pathology H&E images. We validate the ecotypes and their associated molecular characteristics with orthogonal omics data across five clinical cohorts and more than 2,500 patients. These analyses highlight an overall survival benefit for Medium patients treated with immunotherapy. In summary, our study distills the survival-relevant information encoded in the ccRCC tumor microenvironment into prognostic survival ecotypes, which may inform clinical decision making in the future.

18
Multimodal AI fuses proteomic and EHR data for rational prioritization of protein biomarkers in diabetic retinopathy

Lin, J. B.; Mataraso, S. J.; Chadha, M.; Velez, G.; Mruthyunjaya, P.; Aghaeepour, N.; Mahajan, V. B.

2026-02-24 ophthalmology 10.64898/2026.02.23.26346903
Top 2%
11× avg
Show abstract

PurposeThere is a need for novel therapies for diabetic retinopathy (DR) because existing therapies treat only certain features of DR and do not work optimally for all patients. While proteomic studies provide insight into disease pathobiology, they are often limited to small sample sizes due to high costs, limiting their generalizability and reproducibility. Moreover, they often yield lists of tens to hundreds of proteins with differential expression, making it difficult to prioritize the most biologically relevant biomarkers beyond using arbitrary fold-change and false-detection rate cutoffs. Here, we applied a two-stage multimodal AI approach: first, we integrated EHR and proteomics data to rationally prioritize candidate protein biomarkers and, next, validated these biomarkers in an independent cohort. These protein biomarkers of DR are rooted in the EHR data and thereby more likely to be biological drivers of disease. MethodsWe obtained EHR data from a large number of patients with and without DR (N=319,997) from the STARR-OMOP database and obtained aqueous humor liquid biopsies from a subset of these patients (N=101) for high-resolution proteomic profiling. We developed Clinical and Omics Multi-Modal Analysis Enhanced with Transfer Learning (COMET) to perform integrated analysis of proteomics and all available EHR data to identify protein biomarkers of DR. The model was trained in two phases: first, it was pretrained using patients with EHR data alone (N=319,896), and then, it was fine tuned using patients with both EHR and proteomics data (N=101), allowing it to learn both clinical and molecular features associated with DR. Findings from COMET were then validated with liquid biopsies from an independent, validation cohort (N=164). Resultst-distributed stochastic neighbor embedding (t-SNE) analysis of EHR and proteomics data identified proteins clustering with related EHR features. Levels of STX3 and NOTCH2, proteins involved in retinal function, were correlated with a diagnosis of macular edema, a record of a visual field exam, and a prescription for latanoprost, highlighting protein-EHR alignment. The pretrained, multimodal COMET model was superior (AUROC=0.98, AUPRC=0.91) compared to models generated using either EHR or proteomics data alone or without pretraining (AUROC: 0.76 to 0.92; AUPRC: 0.47 to 0.74). The proteins SERPINE1, QPCT, AKR1C2, IL2RB, and SRSF6 were prioritized by the COMET model compared to the models without pretraining, supporting their potential role in DR pathobiology, and were subsequently validated in an independent cohort. ConclusionWe used multimodal AI to prioritize protein biomarkers of DR that are most strongly linked to EHR elements, as well as identifying other protein biomarkers associated with disease features like diabetic macular edema. These findings serve as a foundation for future mechanistic studies and highlight the synergistic value of using multimodal AI to fuse EHR and proteomics data for enhanced proteomics analysis.

19
Deep Agentic Variant Prioritisation for Expert Level Genetic Diagnosis Fast at Scale

Kara, M.; Gungor, A. F.; Kuday, S. E.; Ozcelik, O.; Ozden, F.

2026-02-18 genetic and genomic medicine 10.64898/2026.02.17.26346421
Top 2%
11× avg
Show abstract

Genetic diagnosis remains a formidable challenge characterized by a diagnostic odyssey that spans years, with over half of rare disease patients remaining undiagnosed affecting more than 300 million people on earth. Clinicians must navigate through thousands of candidate variants against a noisy and fragmented literature landscape, a task that overwhelms human cognitive capacity and conventional decision-making approaches. Recent advances in agentic artificial intelligence systems have demonstrated superior performance in complex, multi-step reasoning tasks by systematically evaluating vast amounts of information, breaking down problems into manageable components, and adapting dynamically to new evidence. These capabilities align precisely with the requirements of genetic variant prioritization. Here we present DAVP (Deep Agentic Variant Prioritisation), a hierarchical agentic AI system that represents a major step forward in genetic diagnosis through patient-specific variant evaluation. Unlike traditional approaches that apply generic pathogenicity scores, DAVP evaluates each variant within the full context of the patients clinical presentation, phenotypic profile, and genomic background. The system comprises three interconnected algorithmic components: prelimin8, a gene pre-screening algorithm that rapidly filters the genomic search space; inGeneTopMatch, a semantic knowledge graph algorithm that captures complex gene-phenotype-disease relationships; and elimin8, an in-context learning prioritization algorithm that dynamically ranks variants through iterative knowledge sorting and evidence synthesis. We conducted comprehensive benchmarks measuring diagnostic cumulative distribution function (CDF) recall based on top-k variant recommendations using simulation cases constructed with 1000 Genomes as healthy background genomes and variants from ClinVar as positive controls. DAVP demonstrates strong diagnostic performance superior to expert genetic clinicians while operating at orders of magnitude greater speed and scale. Our results demonstrate that agentic AI systems can transform rare disease diagnostics by combining the systematic evaluation capabilities of artificial intelligence with the nuanced clinical reasoning required for complex genetic diagnosis. This work lays the foundation for a new paradigm in AI-driven genetic medicine that could accelerate diagnosis, reduce healthcare costs, and improve patient outcomes worldwide. The source code and data to reproduce this work are available at https://github.com/Muti-Kara/davp.

20
Gene to Morphology Alignment via Graph Constrained Latent Modeling for Molecular Subtype Prediction from Histopathology in Pancreatic Cancer

Leyva, A.; Akbar, A.; Niazi, K.

2026-03-06 oncology 10.64898/2026.03.05.26347711
Top 2%
11× avg
Show abstract

Molecular subtyping of cancer is traditionally defined in transcriptomic space, yet routine clinical deployment is limited by the availability and cost of sequencing. Meanwhile, histopathology captures rich morphological information that is known to correlate with molecular state but lacks a principled, mechanistic bridge to gene-level representations. We propose a graph-constrained learning framework that aligns morphology-derived signals with a fixed, data-driven gene network discovered via hierarchical Monte Carlo screening. We can derive new gene sets for classification using random sampling, and use the coexpression network of that graph to enforce the learning of a pure morphology model without using gene expression. The resulting model performs subtype prediction using morphology alone, while being explicitly forced to operate through a gene-structured latent space. Structural alignment is enforced during training. For Moffitt classification in pancreatic cancer using PANCAN and TCGA datasets, the model has a reported 85% AUC using an alternative gene set network structure, while the alternate gene set itself has an 84% AUC in all patients that were classified with subtyping with pancreatic cancer in the dataset. This demonstrates that virtual transcriptomics can provide biologically grounded molecular insights using only routine histopathology slides, potentially expanding access to precision oncology in resource-limited settings.